1 Introduction

Maximum likelihood (MLE) and method of moments (MM) are two common methods for constructing a model.

2 Background

I will explain to the reader how one might use MLE and MM to model (a) Glycohemoglobin and (b) Height of adult females. The data will be from National Health and Nutrition Examination Survey 2009-2010 (NHANES), available from the Hmisc package. I will compare and contrast the two methods in addition to comparing and contrasting the choice of underlying distribution.

3 Method

3.1 Data

#Import data
require(dplyr)
Hmisc::getHdata(nhgh)
d1 <- nhgh %>% 
  filter(sex == "female") %>% 
  filter(age >= 18) %>% 
  select(gh, ht) %>% 
  filter(1:n()<=1000)

4 MLE of Height

4.1 Normal Distribution

4.1.1 Estimates of parameters

#HT,Normal, MLE
ll <- function(params, data)  sum(dnorm(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$ht)
g <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(x[i],7),d1$ht)
  }
  out
}

h <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(160,x[i]),d1$ht)
  }
  out
}

In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.

require(stats4)
nLL <- function(mean, sd){
  fs <- dnorm(
        x = d1$ht
      , mean = mean
      , sd = sd
      , log = TRUE
    ) 
  -sum(fs)
}
param_hat <- mle(
    nLL
  , start = list(mean = 160, sd = 5)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2)); plot(profile(param_hat), absVal = FALSE)

In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.

coef(param_hat)
##       mean         sd 
## 160.741906   7.316505

This is the specific value of the two parameters.

4.1.2 Overlay estimated pdf onto histogram

hist(d1$ht,breaks = 100, freq = FALSE)
curve(dnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution.

4.1.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$ht))
curve(pnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.

4.1.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qnorm(qs,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.

4.1.5 Estimated Median

qnorm(0.5,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
## [1] 160.7419

The estimated Median is 160.7419064. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

4.1.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rnorm(M*N,mean=coef(param_hat)[1],sd=coef(param_hat)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

4.1.7 Range of middle 95% of Samp Dist

#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 160.1659 161.2980
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)

The Range of middle 95% of Samp Dist is from 160.1658828 to 161.2979761. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

4.2 Gamma

4.2.1 Estimates of parameters

#HT,Gamma, MLE
ll <- function(params, data)  sum(dgamma(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$ht)
g <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(x[i],0.3),d1$ht)
  }
  out
}

h <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(480,x[i]),d1$ht)
  }
  out
}

In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.

require(stats4)
nLL_gamma <- function(shape, scale){
  fs <- dgamma(
        x = d1$ht
      , shape = shape
      , scale = scale
      , log = TRUE
    ) 
  -sum(fs)
}
param_hat_gamma <- mle(
    nLL_gamma
  , start = list(shape = 480, scale = 0.3)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2)); plot(profile(param_hat_gamma), absVal = FALSE)

In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.

coef(param_hat_gamma)
##       shape       scale 
## 480.0000205   0.3348802

This is the specific value of the two parameters.

4.2.2 Overlay estimated pdf onto histogram

hist(d1$ht,breaks = 100, freq = FALSE)
curve(dgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution.

4.2.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$ht))
curve(pgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.

4.2.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qgamma(qs,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.

4.2.5 Estimated Median

qgamma(0.5,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
## [1] 160.6309

The estimated Median is 160.6308885. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

4.2.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rgamma(M*N,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

4.2.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 160.0639 161.1975

The Range of middle 95% of Samp Dist is from 160.0638804 to 161.1975156. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

4.3 Weibull

4.3.1 Estimates of parameters

#HT,Gamma, MLE
ll <- function(params, data)  sum(dweibull(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$ht)
g <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(x[i],165),d1$ht)
  }
  out
}

h <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(22,x[i]),d1$ht)
  }
  out
}

In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.

require(stats4)
nLL_weibull <- function(shape, scale){
  fs <- dweibull(
        x = d1$ht
      , shape = shape
      , scale = scale
      , log = TRUE
    ) 
  -sum(fs)
}
param_hat_weibull <- mle(
    nLL_weibull
  , start = list(shape = 22, scale = 165)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2)); plot(profile(param_hat_weibull), absVal = FALSE)

In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.

coef(param_hat_weibull)
##     shape     scale 
##  21.85396 164.24718

This is the specific value of the two parameters.

4.3.2 Overlay estimated pdf onto histogram

hist(d1$ht,breaks = 100, freq = FALSE)
curve(dweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution.

4.3.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$ht))
curve(pweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.

4.3.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qweibull(qs,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We maynot say that the sample quantile fits the theoretical quantile very well in the beginning of the figure.

4.3.5 Estimated Median

qweibull(0.5,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
## [1] 161.5156

The estimated Median is 161.5155603. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

4.3.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rweibull(M*N,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

4.3.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 160.8373 162.1827

The Range of middle 95% of Samp Dist is from 160.8372666 to 162.1826811. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

4.4 Conclusion

In this part, we use MLE to model Height of adult females and we use three different distributions. The content above answers questions in Background part.

5 MM of Height

5.1 Normal

5.1.1 Estimates of parameters

According to the feature of normal distribution.E[X]=mean(sample) and V[X]= Var(sample). Thus, we can conclude two parameters below.

mm.mean=mean(d1$ht)
mm.sd=sd(d1$ht)

This is the specific value of the two parameters.

5.1.2 Overlay estimated pdf onto histogram

hist(d1$ht,breaks = 100,freq = F)
curve(dnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution.

5.1.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$ht))
curve(pnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.

5.1.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qnorm(qs,mm.mean,mm.sd)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.

5.1.5 Estimated Median

qnorm(0.5,mean=mm.mean,sd=mm.sd)
## [1] 160.7419

The estimated Median is 160.7419. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

5.1.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rnorm(M*N,mean=mm.mean,sd=mm.sd) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

5.1.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 160.1787 161.3185

The Range of middle 95% of Samp Dist is from 160.178736 to 161.3185258. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

5.2 Gamma

5.2.1 Estimates of parameters

According to the feature of Gamma distribution.E[X]=shape∗scale and V[X]=shape∗scale^2. Thus, we can conclude two parameters below. scale=mean(sample)/var(sample) and shape= mean(sample)^2 / mean(sample).

mm_shape_gamma=mean(d1$ht)^2/var(d1$ht)
mm_scale_gamma=var(d1$ht)/mean(d1$ht)

This is the specific value of the two parameters.

5.2.2 Overlay estimated pdf onto histogram

hist(d1$ht,breaks = 100,freq = F)
curve(dgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution.

5.2.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$ht))
curve(pgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.

5.2.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qgamma(qs,shape=mm_shape_gamma,scale=mm_scale_gamma)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max fits a straight line with y=x. We can say that the sample quantile fits the theoretical quantile.

5.2.5 Estimated Median

qgamma(0.5,shape=mm_shape_gamma,scale=mm_scale_gamma)
## [1] 160.6308

The estimated Median is 160.630794. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

5.2.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rgamma(M*N,shape=mm_shape_gamma,scale=mm_scale_gamma) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

5.2.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 160.0672 161.1875

The Range of middle 95% of Samp Dist is from 160.0671567 to 161.1875228. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

5.3 Weibull

5.3.1 Estimates of parameters

According to the feature of Weibull distribution.E[X]= E(X) = b Γ(1 + 1/a) and V[X]=b^2 * (Γ(1 + 2/a) - (Γ(1 + 1/a))^2). These are not very good representations, we constructed the following equation.

#Define mean of Weibull Distribution
mean.weib = function(lambda,k){
  lambda*gamma(1+1/k)
}

#Define variance of Weibull Distribution
var.weib = function(lambda,k){
  lambda^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}

#Define lambda
lambda = function(samp_mean,k){
  samp_mean/gamma(1+1/k)
}

#Define mean of Weibull Distribution using sample mean
var.weib = function(samp_mean,k){
  lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}

#Define variance of Weibull Distribution using sample mean
var.weib = function(samp_mean,k,samp_var){
  lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)-samp_var
}

#Build function to calculate two parameters in Weibull Distribution
mm.opt = optimize(f = function(x){
  abs(var.weib(k=x,samp_mean = mean(d1$ht),samp_var = var(d1$ht)))
},lower=10,upper = 100)

mm.weib.k = mm.opt$min

mm.weib.lambda = lambda(samp_mean = mean(d1$ht),k=mm.weib.k)
mm.weib.k
## [1] 27.45942
mm.weib.lambda
## [1] 163.9807

This is the specific value of the two parameters.

5.3.2 Overlay estimated pdf onto histogram

hist(d1$ht,breaks = 100,freq = F)
curve(dweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution.

5.3.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$ht))
curve(pweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution.

5.3.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$ht,qs)
theor_qs <- qweibull(qs,shape=mm.weib.k,scale=mm.weib.lambda)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We maynot say that the sample quantile fits the theoretical quantile very well in the figure.

5.3.5 Estimated Median

qweibull(0.5,shape=mm.weib.k,scale=mm.weib.lambda)
## [1] 161.8065

The estimated Median is 161.8065244. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

5.3.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rweibull(M*N,shape=mm.weib.k,scale=mm.weib.lambda) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

5.3.7 Range of middle 95% of Samp Dist

#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 161.2808 162.3238
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)

The Range of middle 95% of Samp Dist is from 161.2808117 to 162.3238118. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

5.4 Conclusion

In this part, we use MM to model Height of adult females and we use three different distributions. The content above answers questions in Background part.

6 MLE of Glycohemoglobin

6.1 Normal Distribution

6.1.1 Estimates of parameters

#HT,Normal, MLE
ll <- function(params, data)  sum(dnorm(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$gh)
g <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(x[i],1),d1$gh)
  }
  out
}

h <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(5.7,x[i]),d1$gh)
  }
  out
}

In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.

require(stats4)
nLL <- function(mean, sd){
  fs <- dnorm(
        x = d1$gh
      , mean = mean
      , sd = sd
      , log = TRUE
    ) 
  -sum(fs)
}
param_hat <- mle(
    nLL
  , start = list(mean = 5.7, sd = 1)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2)); plot(profile(param_hat), absVal = FALSE)

In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.

coef(param_hat)
##     mean       sd 
## 5.724600 1.051721

This is the specific value of the two parameters.

6.1.2 Overlay estimated pdf onto histogram

hist(d1$gh,breaks = 100, freq = FALSE)
curve(dnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.

6.1.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$gh))
curve(pnorm(x,mean=coef(param_hat)[1],sd=coef(param_hat)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.

6.1.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qnorm(qs,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.

6.1.5 Estimated Median

qnorm(0.5,mean=coef(param_hat)[1],sd=coef(param_hat)[2])
## [1] 5.7246

The estimated Median is 5.7246. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

6.1.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rnorm(M*N,mean=coef(param_hat)[1],sd=coef(param_hat)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

6.1.7 Range of middle 95% of Samp Dist

#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 5.644918 5.807284
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)

The Range of middle 95% of Samp Dist is from 5.6449183 to 5.8072845. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

6.2 Gamma

6.2.1 Estimates of parameters

#HT,Gamma, MLE
ll <- function(params, data)  sum(dgamma(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$gh)
g <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(x[i],0.14),d1$gh)
  }
  out
}

h <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(40,x[i]),d1$gh)
  }
  out
}

In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.

require(stats4)
nLL_gamma <- function(shape, scale){
  fs <- dgamma(
        x = d1$gh
      , shape = shape
      , scale = scale
      , log = TRUE
    ) 
  -sum(fs)
}
param_hat_gamma <- mle(
    nLL_gamma
  , start = list(shape = 40, scale = 0.14)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2)); plot(profile(param_hat_gamma), absVal = FALSE)

In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.

coef(param_hat_gamma)
##      shape      scale 
## 40.7065036  0.1406358

This is the specific value of the two parameters.

6.2.2 Overlay estimated pdf onto histogram

hist(d1$gh,breaks = 100, freq = FALSE)
curve(dgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.

6.2.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$gh))
curve(pgamma(x,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.

6.2.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qgamma(qs,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.

6.2.5 Estimated Median

qgamma(0.5,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2])
## [1] 5.677983

The estimated Median is 5.677983. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

6.2.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rgamma(M*N,shape=coef(param_hat_gamma)[1],scale=coef(param_hat_gamma)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

6.2.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 5.607436 5.747332

The Range of middle 95% of Samp Dist is from 5.6074365 to 5.7473322. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

6.3 Weibull

6.3.1 Estimates of parameters

#HT,Gamma, MLE
ll <- function(params, data)  sum(dweibull(data,params[1],params[2],log=TRUE))
#ll(c(2,3),d1$gh)
g <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(x[i],6.2),d1$gh)
  }
  out
}

h <- function(x) {
  out <- x
  for(i in seq_along(x)){
    out[i] <- ll(c(4.1,x[i]),d1$gh)
  }
  out
}

In this chunk, we build function f and function g. These functions are used to calculte the other parameter when fix one. Using the principle of MLE. They help us adjust the parameters.

require(stats4)
nLL_weibull <- function(shape, scale){
  fs <- dweibull(
        x = d1$gh
      , shape = shape
      , scale = scale
      , log = TRUE
    ) 
  -sum(fs)
}
param_hat_weibull <- mle(
    nLL_weibull
  , start = list(shape = 4.1, scale = 6.2)
  , method = "L-BFGS-B"
  , lower = c(0, 0.01)
)

par(mfrow = c(1,2)); plot(profile(param_hat_weibull), absVal = FALSE)

In this chunk, we use function to calculate the MLE when set two parameters that close to real MLE. From the figure, we can see when z=0, the two parameters are what we need.

coef(param_hat_weibull)
##    shape    scale 
## 4.125254 6.173884

This is the specific value of the two parameters.

6.3.2 Overlay estimated pdf onto histogram

hist(d1$gh,breaks = 100, freq = FALSE)
curve(dweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.

6.3.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$gh))
curve(pweibull(x,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.

6.3.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qweibull(qs,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.

6.3.5 Estimated Median

qweibull(0.5,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2])
## [1] 5.64902

The estimated Median is 5.6490199. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

6.3.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rweibull(M*N,shape=coef(param_hat_weibull)[1],scale=coef(param_hat_weibull)[2]) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

6.3.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 5.526716 5.772334

The Range of middle 95% of Samp Dist is from 5.526716 to 5.772334. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

6.4 Conclusion

In this part, we use MLE to model Glycohemoglobin of adult females and we use three different distributions. The content above answers questions in Background part.

7 MM of Glycohemoglobin

7.1 Normal

7.1.1 Estimates of parameters

According to the feature of normal distribution.E[X]=mean(sample) and V[X]= Var(sample). Thus, we can conclude two parameters below.

mm.mean=mean(d1$gh)
mm.sd=sd(d1$gh)

This is the specific value of the two parameters.

7.1.2 Overlay estimated pdf onto histogram

hist(d1$gh,breaks = 100,freq = F)
curve(dnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.

7.1.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$gh))
curve(pnorm(x,mm.mean,mm.sd),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.

7.1.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qnorm(qs,mm.mean,mm.sd)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.

7.1.5 Estimated Median

qnorm(0.5,mean=mm.mean,sd=mm.sd)
## [1] 5.7246

The estimated Median is 5.7246. We use qnorm to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

7.1.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rnorm(M*N,mean=mm.mean,sd=mm.sd) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

7.1.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 5.645479 5.808015

The Range of middle 95% of Samp Dist is from 5.6454789 to 5.8080146. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

7.2 Gamma

7.2.1 Estimates of parameters

According to the feature of Gamma distribution.E[X]=shape∗scale and V[X]=shape∗scale^2. Thus, we can conclude two parameters below. scale=mean(sample)/var(sample) and shape= mean(sample)^2 / mean(sample).

mm_shape_gamma=mean(d1$gh)^2/var(d1$gh)
mm_scale_gamma=var(d1$gh)/mean(d1$gh)

This is the specific value of the two parameters.

7.2.2 Overlay estimated pdf onto histogram

hist(d1$gh,breaks = 100,freq = F)
curve(dgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-8 interval.

7.2.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$gh))
curve(pgamma(x,shape=mm_shape_gamma,scale=mm_scale_gamma),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.

7.2.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qgamma(qs,shape=mm_shape_gamma,scale=mm_scale_gamma)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.

7.2.5 Estimated Median

qgamma(0.5,shape=mm_shape_gamma,scale=mm_scale_gamma)
## [1] 5.660259

The estimated Median is 5.6602591. We use qgamma to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

7.2.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rgamma(M*N,shape=mm_shape_gamma,scale=mm_scale_gamma) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

7.2.7 Range of middle 95% of Samp Dist

quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 5.579084 5.744042

The Range of middle 95% of Samp Dist is from 5.5790839 to 5.7440416. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

7.3 Weibull

7.3.1 Estimates of parameters

According to the feature of Weibull distribution.E[X]= E(X) = b Γ(1 + 1/a) and V[X]=b^2 * (Γ(1 + 2/a) - (Γ(1 + 1/a))^2). These are not very good representations, we constructed the following equation.

#Define mean of Weibull Distribution
mean.weib = function(lambda,k){
  lambda*gamma(1+1/k)
}

#Define variance of Weibull Distribution
var.weib = function(lambda,k){
  lambda^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}

#Define lambda
lambda = function(samp_mean,k){
  samp_mean/gamma(1+1/k)
}

#Define mean of Weibull Distribution using sample mean
var.weib = function(samp_mean,k){
  lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)
}

#Define variance of Weibull Distribution using sample mean
var.weib = function(samp_mean,k,samp_var){
  lambda(samp_mean,k)^2*(gamma(1+2/k)-(gamma(1+1/k))^2)-samp_var
}

#Build function to calculate two parameters in Weibull Distribution
mm.opt = optimize(f = function(x){
  abs(var.weib(k=x,samp_mean = mean(d1$gh),samp_var = var(d1$gh)))
},lower=10,upper = 100)

mm.weib.k = mm.opt$min

mm.weib.lambda = lambda(samp_mean = mean(d1$gh),k=mm.weib.k)
mm.weib.k
## [1] 10.00008
mm.weib.lambda
## [1] 6.017337

This is the specific value of the two parameters.

7.3.2 Overlay estimated pdf onto histogram

hist(d1$gh,breaks = 100,freq = F)
curve(dweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)

The figure above is that overlay estimated pdf onto histogram of the empirical distribution. It can be seen that their distribution is very concentrated in the 4-7 interval

7.3.3 Overlay estimated CDF onto eCDF

plot(ecdf(d1$gh))
curve(pweibull(x,shape=mm.weib.k,scale=mm.weib.lambda),col="red",lwd=6,add=T)

The figure above is that overlay estimated CDF onto eCDF of the empirical distribution. The figure does not fit well, at least not as good as the height. It may be because the distribution is too concentrated. Small disturbances have big consequences.

7.3.4 QQ plot (sample vs estimated dist)

qs <- seq(0.05,0.95,length=50)
sample_qs <- quantile(d1$gh,qs)
theor_qs <- qweibull(qs,shape=mm.weib.k,scale=mm.weib.lambda)
plot(sample_qs,theor_qs,pch=16)
abline(0,1)

Sample max doesn’t fit a straight line with y=x perfectly. We cannot say that the sample quantile fits the theoretical quantile very well.

7.3.5 Estimated Median

qweibull(0.5,shape=mm.weib.k,scale=mm.weib.lambda)
## [1] 5.800788

The estimated Median is 5.8007881. We use qweibull to select 50% of the observations, 50% of the observations are less than it, and 50% of the observations are greater than it. So this is the median of the sample.

7.3.6 Median Samp Dist (hist)

M <- 5000
N <- 1000
out <- rweibull(M*N,shape=mm.weib.k,scale=mm.weib.lambda) %>% array( dim=c(M,N))
sample_dist <- apply(out,1,median)
hist(sample_dist, breaks = 100)

The figure above is that median Samp Dist (hist). We simulated 5000 experiments, each with 1000 observations, so we take the median of these 5000 experiments. Draw as histogram.

7.3.7 Range of middle 95% of Samp Dist

#hist(sample_dist, breaks = 100)
quantile(sample_dist,c(0.05/2, 1 - 0.05/2))
##     2.5%    97.5% 
## 5.748073 5.850781
#abline(v = quantile(sample_dist,c(0.05/2, 1 - 0.05/2)), col = "blue", lty = 2)

The Range of middle 95% of Samp Dist is from 5.748073 to 5.8507811. This is the 95% confidence interval. We have 95% confidence that median will be within this range. This has an indicator effect on statistical inference.

7.4 Conclusion

In this part, we use MM to model Glycohemoglobin of adult females and we use three different distributions. The content above answers questions in Background part.